add resource journal #6586

garlick · 2025-01-28T19:41:32Z

This adds a resource journal streaming RPC similar the one offered by the job manager. Just a first cut at this point.

This doesn't change what's posted to the persistent resource.eventlog in the KVS but does add one new event called restart that's only for journal consumption. It provides a baseline for mapping execution targets to hostnames in the current instance, and sets the initial online set after a restart.

Unlike the job manager journal, this doesn't have as much volume to deal with so no options for event filtering or skipping historical data are provided as yet.

flux resource eventlog can be used to dump and optionally follow this log. This is not currently polished at all - it just dumps the events in JSON form, one per line.

For more detail on what's in this log and how the journal is formatted, see the proposed RFC:

rfc44: add rfc for resource.eventlog rfc#440

garlick · 2025-01-31T00:49:58Z

I added a quick --wait=EVENT option to flux resource eventlog and a sharness test. Will remove WIP - maybe good for a start that we can experiment with a bit?

grondo · 2025-01-31T00:52:37Z

Yeah, let me see if I can generalize the JournalConsumer class to also work with the resource eventlog.

garlick · 2025-02-04T17:12:24Z

Did we want to add this to 0.71 as an experimental feature?

grondo · 2025-02-04T17:17:52Z

I was halfway through reviewing it and had a few commits that abstract the current Python JournalConsumer class to add a ResourceJournalConsumer class and I was currently writing some tests.

Have we estimated what the size of the in-memory history of all events would be like on something like Elcap with a long uptime?

garlick · 2025-02-04T17:22:25Z

That might be worrisome actually, now that you mention it. Perhaps we'll need some way to cull non-persistent events (e.g. other than drain/undrain) that are older than a threshold. It might be a bit too last minute to add that here, sadly.

grondo · 2025-02-04T17:32:45Z

I'm not sure there's a rush to add this as experimental because I'm not sure there's an immediate use case without the Python consumer class (unless I'm forgetting something!)

garlick · 2025-02-04T17:53:54Z

I think this is a priority for @kkier but there is little advantage in putting this in as is - in fact it may cause problems. Let's try to deliver a good first step in the next release.

kkier · 2025-02-04T17:57:03Z

Just confirming - this is a priority for me (and the people I report to on both sides of the aisle, as it were), but if it ends up in an off-cycle release that I have to manually distribute, that's a thing I can deal with.

garlick · 2025-02-04T18:17:10Z

but if it ends up in an off-cycle release that I have to manually distribute, that's a thing I can deal with.

To be clear, although it's not that hard on our end to produce one-off tags and packages, this particular change modifies a core service and would require downtime to distribute.

grondo · 2025-02-04T18:21:54Z

One thing I've noticed in my testing is that canceling the journal RPC with resource.journal-cancel doesn't seem to result in ENODATA eventually being sent to the client. This differs from the job manager journal service and would slightly complicate the implementation of Python interfaces with a shared base class.

Of course it is possible I've made a mistake in testing.

kkier · 2025-02-04T18:24:11Z

but if it ends up in an off-cycle release that I have to manually distribute, that's a thing I can deal with.

To be clear, although it's not that hard on our end to produce one-off tags and packages, this particular change modifies a core service and would require downtime to distribute.

100% understood, it is what it is.

garlick · 2025-02-04T18:29:24Z

canceling the journal RPC with reosurce.journal-cancel doesn't seem to result in ENODATA eventually being sent to the client

It should. (if spelled correctly :-) However I didn't add a test for that so I should verify.

grondo · 2025-02-04T23:07:11Z

I stumbled across the problem I think:

diff --git a/src/modules/resource/reslog.c b/src/modules/resource/reslog.c
index 61aae1218..513d54cc6 100644
--- a/src/modules/resource/reslog.c
+++ b/src/modules/resource/reslog.c
@@ -328,7 +328,7 @@ static void journal_cb (flux_t *h,
         errstr = "journal requires streaming RPC flag";
         goto error;
     }
-    if (send_backlog (reslog, msg))
+    if (!send_backlog (reslog, msg))
         return;
     if (flux_msglist_append (reslog->consumers, msg) < 0)
         goto error;

garlick · 2025-02-05T00:16:50Z

Aww I tested that. Must have broken it with a last minute change. I need to add a test too :-(

Problem: the reslog class has no way to access the resource inventory, but it will be useful to send journal consumers a copy of R when the resource-define event is emitted. Pass the resource_ctx to reslog_create() instead of just the flux_t handle. Adjust internal uses of the flux_t handle to get it via reslog->ctx->h instead of reslog->h.

Problem: the full resource eventlog, including online/offline events that are not committed to the KVS, may need to be monitored. Keep events in a json array in memory, including the events that were read from the KVS at startup, if any. Filter out any historical resource-define events. These are meant for synchronization on the availability of R and that only pertains to the current instance.

Problem: there is no way to observe the journal in real time, with non-persistent online/offline events included. Add a resource.journal RPC with protocol similar to the job manager journal.

Problem: a resource journal consumer will get online/offline events before knowing the size of the instance or the hostname mapping. Post a 'restart' event when the resource module is loaded with the following keys: ranks An idset containing all valid ranks: 0 to size-1 online An idset containing any ranks that are initially online. This is normally empty except when starting with monitor-force-up in test. nodelist Contents of the hostlist broker attribute This event is not made persistent in the KVS resource.eventlog.

Problem: there is no convenient tool for accessing the resource journal. Add flux resource eventlog [--follow] [--wait=EVENT].

Problem: there are no tests for the resource journal or the flux resource eventlog command. Add a sharness test for this purpose.

Problem: flux resource eventlog has no documentation. Add an entry to the man page.

garlick · 2025-02-06T00:40:49Z

Ok some small changes:

fix the bug @grondo pointed
renamed -f to -F to match python class (for future work by @grondo)
cover -F in test
rebase on current master

grondo

This LGTM!

garlick · 2025-02-06T03:16:47Z

Thanks! I'll set MWP.

codecov · 2025-02-06T03:18:04Z

Codecov Report

Attention: Patch coverage is 68.15287% with 50 lines in your changes missing coverage. Please review.

Project coverage is 79.50%. Comparing base (1fbd1dd) to head (6769b2e).
Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
src/modules/resource/reslog.c	60.37%	42 Missing ⚠️
src/cmd/flux-resource.py	81.48%	5 Missing ⚠️
src/modules/resource/monitor.c	90.00%	2 Missing ⚠️
src/modules/resource/resource.c	75.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6586      +/-   ##
==========================================
- Coverage   79.53%   79.50%   -0.04%     
==========================================
  Files         531      531              
  Lines       88213    88354     +141     
==========================================
+ Hits        70160    70242      +82     
- Misses      18053    18112      +59

Files with missing lines	Coverage Δ
src/modules/resource/upgrade.c	`67.44% <ø> (ø)`
src/modules/resource/resource.c	`86.66% <75.00%> (+0.13%)`	⬆️
src/modules/resource/monitor.c	`70.00% <90.00%> (+2.50%)`	⬆️
src/cmd/flux-resource.py	`94.26% <81.48%> (-0.78%)`	⬇️
src/modules/resource/reslog.c	`69.87% <60.37%> (-4.96%)`	⬇️

... and 7 files with indirect coverage changes

garlick force-pushed the resource_journal branch 3 times, most recently from 5e12001 to 3ff612f Compare January 31, 2025 00:48

garlick changed the title ~~WIP: add resource journal~~ add resource journal Jan 31, 2025

garlick force-pushed the resource_journal branch from 3ff612f to e047195 Compare January 31, 2025 03:39

garlick added 7 commits February 5, 2025 15:56

resource: add resource.journal RPC

f2ac917

Problem: there is no way to observe the journal in real time, with non-persistent online/offline events included. Add a resource.journal RPC with protocol similar to the job manager journal.

flux-resource: add eventlog subcommand

5dbd21a

Problem: there is no convenient tool for accessing the resource journal. Add flux resource eventlog [--follow] [--wait=EVENT].

testsuite: cover resource journal

a15e13c

Problem: there are no tests for the resource journal or the flux resource eventlog command. Add a sharness test for this purpose.

flux-resource(1): add eventlog subcommand

6769b2e

Problem: flux resource eventlog has no documentation. Add an entry to the man page.

garlick force-pushed the resource_journal branch from e047195 to 6769b2e Compare February 6, 2025 00:39

grondo approved these changes Feb 6, 2025

View reviewed changes

garlick added the merge-when-passing label Feb 6, 2025

mergify bot merged commit 496a332 into flux-framework:master Feb 6, 2025
34 of 35 checks passed

garlick mentioned this pull request Feb 6, 2025

resource eventlog grows without bound in memory #6610

Open

garlick deleted the resource_journal branch February 6, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add resource journal #6586

add resource journal #6586

garlick commented Jan 28, 2025

garlick commented Jan 31, 2025

grondo commented Jan 31, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 4, 2025

kkier commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025 •

edited

Loading

kkier commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 5, 2025 •

edited

Loading

garlick commented Feb 6, 2025

grondo left a comment

garlick commented Feb 6, 2025

codecov bot commented Feb 6, 2025

add resource journal #6586

add resource journal #6586

Conversation

garlick commented Jan 28, 2025

garlick commented Jan 31, 2025

grondo commented Jan 31, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 4, 2025

kkier commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025 • edited Loading

kkier commented Feb 4, 2025

garlick commented Feb 4, 2025

grondo commented Feb 4, 2025

garlick commented Feb 5, 2025 • edited Loading

garlick commented Feb 6, 2025

grondo left a comment

Choose a reason for hiding this comment

garlick commented Feb 6, 2025

codecov bot commented Feb 6, 2025

Codecov Report

grondo commented Feb 4, 2025 •

edited

Loading

garlick commented Feb 5, 2025 •

edited

Loading